Causal connections between socioeconomic disparities and COVID-19 in the USA

With the increasing use of machine learning models in computational socioeconomics, the development of methods for explaining these models and understanding the causal connections is gradually gaining importance. In this work, we advocate the use of an explanatory framework from cooperative game theory augmented with do calculus, namely causal Shapley values. Using causal Shapley values, we analyze socioeconomic disparities that have a causal link to the spread of COVID-19 in the USA. We study several phases of the disease spread to show how the causal connections change over time. We perform a causal analysis using random effects models and discuss the correspondence between the two methods to verify our results. We show the distinct advantages a non-linear machine learning models have over linear models when performing a multivariate analysis, especially since the machine learning models can map out non-linear correlations in the data. In addition, the causal Shapley values allow for including the causal structure in the variable importance computed for the machine learning model.

Given the complexity of the problems that we address the relative difficulty in engineering all the variables necessary to address it, there is always a possibility of the existence of confounding variables in the analysis. The causal Shapley value framework, as described in the Materials and Methods section, allow for the assumption of the existence of confounding variables within a group of variables that are not causally connected or are at the same level in the directed acyclic graph. To test whether assuming the presence of confounding variables change the results of our analysis we performed all the analysis for the confirmed case rate (two phases, three regions and six causal ordering) assuming the presence of confounding variables in all the groups. The results of our analysis is shown in Figure S1 and Figure S2. It can be seen that assuming the presence of confounding variables do not change the final result of the analysis at all. Hence, we can say with confidence that we safely ignore the presence of confounding variables. That being said, one can argue that variables like the average level of education in a county are highly correlated with several of the socioeconomic metrics that we consider and maybe it can replace one of the metrics that we consider here. However, since we test the causal connections between the variables under a certain hypothesis represented by the directed acyclic graphs, studying the details of all possible hypothesis is beyond the scope of our work and we would like to motivate future works to consider a different set of variable to understand better, what the true causations in the real-world are.

Additional causal orderings
We have considered three causal orderings in the main text. To give some more possible scenarios for the causal dependence withing the socioeconomic metrics we propose three more causal ordering. We will try to motivate them here and provide the results in the figures that follow ]. This causal ordering builds upon the first causal ordering CO#1, except that the proportion of senior citizens is no longer on equal footing as unemployment and employed, but moved up in the causal chain. This causal ordering takes into account the fact that senior citizens who are looking for employment may not be able to find employment due to a skills mismatch with the required skills today and senior citizens may be structurally unemployed [1,2] ]. This causal ordering has income per capita, the proportion of a county working in construction, service, delivery or production, unemployment, employed, and the proportion of senior citizens on equal footing. The rationale here is that income per capita and the proportion of a county working in construction, service, delivery or production are on equal footing with unemployment and employment since construction, service, delivery, or production jobs are (a) more likely to be temporary (b) pay low wages and counties with people working in those job profiles tend to have fewer permanent jobs and a higher unemployment rate due to the decline of manufacturing in the last half century [3]. ]. This causal ordering has the proportion of senior citizens causing the proportion of non-whites, and is otherwise the same as the first causal ordering CO#1. Since senior citizens are less likely to be non-white, a higher fraction of senior citizens causes there to be a lower proportion of non-whites [4].
The results obtained by assuming these causal ordering between the socioeconomic metrics can be found in Figure S3 and Figure S4. The results from these causal orderings are quite similar to the ones obtained from the CO#1, CO#2 and CO#3, pointing to the same sets of variables as the most causally connected ones in Phase I. In Phase II we see the diminished impact of the socioeconomic metrics in all regions except for the west coast states.

Causal analyses for death rates
The analysis in the main paper focuses only on understanding the causal connections between the socioeconomic metrics and the confirmed case rate. A similar analysis can be performed using the death rate as the endogenous variable. We show the results of this analysis in Figure S5 to Figure S8. We see that the landscape of the causal dependence is a bit different for the Phase I February 2020 to July 2020

11/13
The East Coast The West Coast All of the United States The Southern States Figure S10. Correlation. matrices for the three regions of the USA considered in this work and the one for all of the USA. for Phase II of the pandemic.